[Question] Sed and regex string manipulation
-
[email protected]replied to [email protected] last edited by
Hello! I will take a look at it, I just haven't had a chance over the last day. Give me a couple days and I will give some feedback. Bear in mind I am not an expert, so I might not have much to offer, but I'll share what I can.
-
[email protected]replied to [email protected] last edited by
This might work, but I think it is best to not tinker further if you already have a working script (especially one that you understand and can modify further if needed).
perl -pe 's/\[[^]]+\]\((?!https?)[^#]*#\K[^)]+(?=\))/lc $&=~s:%20|\d\K\.(?=\d):-:gr/ge'
-
[email protected]replied to [email protected] last edited by
I did it!! It also handles the case where an external link and internal link are on the same line
sed -E ':l;s/(\[[^]]*\]\()([^)#]*#[^)]*\))/\1\n\2/;Te;H;g;s/\n//;s/\n.*//;x;s/.*\n//;/^https?:/!{:h;s/^([^#]*#[^)]*)(%20|\.)([^)]*\))/\1-\3/;th;s/(#[^)]*\))/\L\1/;};tl;:e;H;z;x;s/\n//;'
Here is my annotated file
# Begin loop :l; # Bisect first link in pattern space into pattern space and append to hold space # Example: `text [label](file#fragment)' # Pattern space: `file#fragment)' # Hold space: `text [label](' # Steps: # 1. Strategically insert \n # 1a. If this fails, branch out # 2. Append to hold space (this creates two \n's. It feels weird for the # first iteration, but that's ok) # 3. Copy hold space to pattern space, remove first \n, then trim off # everything past the second \n # 4. Swap pattern/hold, and trim off everything up to and incl the last \n s/(\[[^]]*\]\()([^)#]*#[^)]*\))/\1\n\2/; Te; H; g; s/\n//; s/\n.*//; x; s/.*\n//; # Modify only if it is an internal link /^https?:/! { # Add hyphens :h; s/^([^#]*#[^)]*)(%20|\.)([^)]*\))/\1-\3/; th; # Make lowercase s/(#[^)]*\))/\L\1/; }; # "conditional" branch so it checks the next conditional again tl; # Exit: join pattern space to hold space, then move to pattern space. # Since the loop uses H instead of h, have to make sure hold space is empty :e; H; z; x; s/\n//;
-
[email protected]replied to [email protected] last edited by
Wow ! Thank you ! It did a rapid test on a test-file.md
[Just a test](#just-a-test) [Just a link](https://mylink/%20with%20space.com) [External link](readme.md#just-a-test) [Link with numbers](readme.md#1-3-this-is-another-test) [Link with numbers](Another%20file%20to%20readme.md#1-3-this-is-another-test)
Great job ! Thank you very much !!! I'm really impressed what someone with proper knowledge can do ! However, I really do not want to mess around with your regex... This will only call for disaster xD ! I will keep preciously your regex and annotated file in my knowledge base, I'm sure some time in the future I will come back to it and try to break it down as learning process.
Thank you very much !!!
-
[email protected]replied to [email protected] last edited by
Hello
Sorry to pin you, I just gave pandoc a try but it doesn't work and I had to dig a bit further into the web to find out why !
Links to Headings with Spaces are not specified by CommonMark and each tool implement a different approach... Most replace space with hyphens other use URL encoding (%20). So even though pandoc looks awesome it doesn't work for my use case (or did i miss something? Feel free to comment).
You can give it a try on https://pandoc.org/try/ with commonmark to gfm:
[Just a test](#Just a test) [Just a link](https://mylink/%20with%20space.com) [External link](Readme.md#JUST%20a%20test) [Link with numbers](readme.md#1.3%20this%20is%20another%20test) [Link with numbers](Another%20file%20to%20readme.md#1.3%20this%20is%20another%20test)
If you prefere a cli version:
pandoc --from=commonmark_x --to=gfm+gfm_auto_identifiers "/home/user/Documents/test.md" -o "pandoc_test.md"
-
[email protected]replied to [email protected] last edited by
Thank you ! It does actually ticks every use case (for my files) looks pretty rad !
This might work, but I think it is best to not tinker further if you already have a working script (especially one that you understand and can modify further if needed).
I totally agree but I will keep your regex as reference, in the near future I will give it a try to decompose you regex as learning process but it looks rather very complex !
Another user came up with the following solution:
sed -E ':l;s/(\[[^]]*\]\()([^)#]*#[^)]*\))/\1\n\2/;Te;H;g;s/\n//;s/\n.*//;x;s/.*\n//;/^https?:/!{:h;s/^([^#]*#[^)]*)(%20|\.)([^)]*\))/\1-\3/;th;s/(#[^)]*\))/\L\1/;};tl;:e;H;z;x;s/\n//;'
Just as a little experiment, If you want to spend some time and give me a answer, what do you think? It's a another way to achieve the same kind of results but they are significantly different. I know there a thousand ways to achieve the same results but I'm kinda curious how it looks from an experts eyes :).
Thanks again for your help and the time you took to write up a complex regex for my use case !
-
[email protected]replied to [email protected] last edited by
Hey take your time
Don't worry even if you forget, you did more than enough to help some random on the web ! 2 other users came up with a plain/bare bone regex solution if you want to have a look and maybe there's something you can learn out of it? (I doubt it xD).
Plain sed regex (https://lemmy.ml/post/25346014/16453351)
sed -E ':l;s/(\[[^]]*\]\()([^)#]*#[^)]*\))/\1\n\2/;Te;H;g;s/\n//;s/\n.*//;x;s/.*\n//;/^https?:/!{:h;s/^([^#]*#[^)]*)(%20|\.)([^)]*\))/\1-\3/;th;s/(#[^)]*\))/\L\1/;};tl;:e;H;z;x;s/\n//;'
Plain Pearl regex (https://lemmy.ml/post/25346014/16453161)
perl -pe 's/\[[^]]+\]\((?!https?)[^#]*#\K[^)]+(?=\))/lc $&=~s:%20|\d\K\.(?=\d):-:gr/ge'
Nonetheless, I really prefere your solution because as someone else said I will have an easier time to change a script I "understand". Soo thanks again !
-
[email protected]replied to [email protected] last edited by
Hey I just did a quick web search and found this. I haven't used the tool specifically before. However I recommend either searching the web for a similar tool or using a chatgpt like tool to create a python script that'll achieve your end result. Sed and regex are cool and useful, but they're only going to make it more difficult to achieve what you need.
-
[email protected]replied to [email protected] last edited by
No problem. I think this is a great "final boss" question for learning sed, because it turns out it is deceptively hard!! You have to understand not only a lot about regex, but about sed to get it right. I learned a lot about sed just by tackling this problem!
I really do not want to mess around with your regex
It is very delicate for sure, but one part you can for sure change is at the
# Add hyphens
part. In the regex you can see(%20|\.)
. These are a list of "characters" which get converted to hyphens. For example, you could modify it to(%20|\.|\+)
and it will convert+
s to-
s as well!Still it is not perfect:
- If the link spans multiple lines, the regex won't match
- If the link contains escaped characters like
\\\\\[LINK](#LINK)
or[LINK\]\\\\](#LINK)
- If the link is inside a code block ``` it will get changed (which may or may not be intended)
But for a sed-only solution this is about as good as it will get I'm afraid.
Overall I'm very happy with it. Someday I would like to make a video that goes into depth about sed, since it is tricky to learn just from the docs.
-
[email protected]replied to [email protected] last edited by
Well, I'm not going to even try understanding the various features used in that
sed
command. I do know how to use basic loops with labels, but I never bothered with all the buffer manipulation stuff. I'd rather use awk/perl/python for those cases.